
THE INFIMNIST DATASET
=====================


1- BACKGROUND
-------------

This code produces an infinite supply of digit images derived from the well
known MNIST dataset using pseudo-random deformations and translations.  This
is a streamlined version of the code written by Loosli, Canu, and Bottou to
support the experiments reported in paper "Training Invariant Support Vector
Machines using Selective Sampling" by Loosli, Canu, and Bottou

A subset of the examples generated by this code are known as MNIST8M and used
to be available on <http://leon.bottou.org/papers/loosli-canu-bottou-2006>.
These files were deleted from the NEC servers around 2014. Instead of
distributing these files, we now distribute the program that generates them.

Each example is identified by a long integer index that determines the source
of the example and the transformations applied to the pattern. The examples
numbered 0 to 9999 are the standard MNIST testing examples. The examples
numbered 10000 to 69999 are the standard MNIST training examples. Each example
with indice i >= 70000 is generated by applying a pseudorandom transformation
to the MNIST training example numbered 10000+((i-10000)%60000).


2- DATA FILES
-------------

Six data files are located in directory "data".

* Files "t10k-images-idx3-ubyte", "t10k-labels-idx1-ubyte",
  "train-images-idx3-ubyte" and "train-labels-idx1-ubyte" are the 
  pristine MNIST data files also available on the the web page
  <http://yann.lecun.com/exdb/mnist>.
* File "tangVec_float_60000x28x28.bin" contains tangent vectors
  for the MNIST training images.
* File "fields_float_1522x28x28.bin" contains pseudo-random 
  vector fields used to generate the character deformations. 

All six files must be available at execution time and 
reside in the same directory


3- COMPILING
------------

The supplied makefiles are very standard and should work 
on nearly all machines. Please customize the variable CFLAGS 
as necessary for your compiler.

Unix:    $ make
Windows: C> nmake /f NMakefile


4- USING THE EXECUTABLE INFIMNIST
---------------------------------

Synopsis:

 $ infimnist [-d <datadir>] <format> <first> <last> 

Option -d <datadir> can be used to specify the location of the data files. 
The default data directory is simply data in the current directory. 
Arguments <first> and <last> define the first and last index of the range 
of examples written to the standard output. Argument <format> describes the 
format of the produced data. Any unambiguous prefix of the following 
formats are recognized: 

* "patterns" produces an image file using the standard MNIST binary format.
* "labels" produce a label files using the standard MNIST binary format.
* "svmlight" produces a file suitable for SVMLight or LibSVM. 
* "vw" produces a file suitable for Vowpal Wabbit. 
* "arff" produces a sparse ARFF file suitable for Weka.
* "display" produces rudimentary ASCII art.

Examples:

* Generating files containing the standard MNIST testing set:

  $ infimnist lab 0 9999 > test10k-labels
  $ infimnist pat 0 9999 > test10k-patterns

* Generating files containing the standard MNIST training set:

  $ infimnist lab 10000 69999 > mnist60k-labels
  $ infimnist pat 10000 69999 > mnist60k-patterns

* Generating files containing the MNIST8M training set:

  $ infimnist lab 10000 8109999 > mnist8m-labels
  $ infimnist pat 10000 8109999 > mnist8m-patterns

* Generating a LibSVM compatible MNIST8M file:

  $ infimnist svm 10000 8109999 > mnist8m-libsvm

* Showing the first 10 deformed digits in ASCII art

  $ infimnist display 70000 70004


5- USING INFIMNIST AS A LIBRARY
-------------------------------

Files "infimnist.h" and "infimnist.c" form a self-contained
library that you can use to generate an infinite amount
of MNIST-like examples on the fly. This is adequately
explained by the comments found in file "infimnist.h".




